The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
The points distribution for this case is as follows:
Resources Avaliable: https://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes)
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
vehicle_df = pd.read_csv('vehicle.csv')
vehicle_df.shape
Data pre-processing - Understand the data and treat missing values (Use box plot), outliers (5 points)
vehicle_df.info()
vehicle_df.head()
vehicle_df.tail()
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
vehicle_df.isnull().sum()
MIssing values are less than 1% of the data so we will be removing for our analysis
vehicle_df = vehicle_df.dropna(axis = 0, how = 'any')
print(vehicle_df.shape)
print('*'*40)
print(vehicle_df.isnull().sum())
vehicle_df.describe().transpose()
pd.value_counts(vehicle_df['class'])
import matplotlib.pyplot as plt
%matplotlib inline
pd.value_counts(vehicle_df["class"]).plot(kind="bar")
vehicle_df.boxplot(figsize=(40,10))
# Checking for outliers in the continuous variables
num_vehicle_df = vehicle_df[['radius_ratio','pr.axis_aspect_ratio', 'max.length_aspect_ratio','scaled_variance','scaled_variance.1','scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1']]
# Checking outliers at 25%,50%,75%,90%,95% and 99%
num_vehicle_df.describe(percentiles=[.25,.5,.75,.90,.95,.99])
# Clubbing levels based on the percentiles
vehicle_df['radius_ratio'] = np.where(vehicle_df['radius_ratio'] >234.880000, 234.880000, vehicle_df['radius_ratio'])
vehicle_df['pr.axis_aspect_ratio'] = np.where(vehicle_df['pr.axis_aspect_ratio'] >75.880000, 75.880000, vehicle_df['pr.axis_aspect_ratio'])
vehicle_df['max.length_aspect_ratio'] = np.where(vehicle_df['max.length_aspect_ratio'] >12.000000, 12.000000, vehicle_df['max.length_aspect_ratio'])
vehicle_df['max.length_aspect_ratio'] = np.where(vehicle_df['max.length_aspect_ratio'] <7.000000, 7.000000, vehicle_df['max.length_aspect_ratio'])
vehicle_df['scaled_variance'] = np.where(vehicle_df['scaled_variance'] >234.000000, 234.000000, vehicle_df['scaled_variance'])
vehicle_df['scaled_variance.1'] = np.where(vehicle_df['scaled_variance.1'] >726.400000, 726.400000, vehicle_df['scaled_variance.1'])
vehicle_df['scaled_radius_of_gyration.1'] = np.where(vehicle_df['scaled_radius_of_gyration.1'] >85.000000, 85.000000, vehicle_df['scaled_radius_of_gyration.1'])
vehicle_df['skewness_about'] = np.where(vehicle_df['skewness_about'] >16.000000, 16.000000, vehicle_df['skewness_about'])
vehicle_df['skewness_about.1'] = np.where(vehicle_df['skewness_about.1'] >29.000000, 29.000000, vehicle_df['skewness_about.1'])
vehicle_df.boxplot(figsize=(40,10))
Cleared all the outliers, now the data looks fine.
Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (5 points)
vehicle_df.hist(figsize=(25,15))
sns.pairplot(vehicle_df, diag_kind='kde')
sns.pairplot(vehicle_df, diag_kind='kde', hue = 'class')
Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance) - 10 points
# Drop class variables
vehicle_df_new = vehicle_df.drop(['class'], axis =1)
vehicle_df_new.head()
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore
X = vehicle_df.drop(['class'], axis =1)
y = vehicle_df["class"]
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=100)
from scipy.stats import zscore
X_new=X_train.apply(zscore)
X_new.head()
X_new.boxplot(figsize=(20,3))
# PCA
# Step 1 - Create covariance matrix
cov_matrix = np.cov(X_new.T)
print('Covariance Matrix \n%s', cov_matrix)
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
plt.plot(var_exp)
# Ploting
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
Visually we can observe that their is steep drop in variance explained with increase in number of PC's.
We will proceed with 9 components here. But depending on requirement 95% variation or 6 to 9 components will also do good
# Using scikit learn PCA here. It does all the above steps and maps data to PCA dimensions in one shot
from sklearn.decomposition import PCA
# NOTE - we are generating only 9 PCA dimensions
pca = PCA(n_components=9)
data_reduced = pca.fit_transform(X_train)
data_reduced.transpose()
df_comp = pd.DataFrame(pca.components_,columns=list(X_train))
df_comp.head()
plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma',)
#Doing the PCA on the train data
pca.fit(X_train)
pca.components_
colnames = list(X_train.columns)
pcs_df = pd.DataFrame({'PC1':pca.components_[0],'PC2':pca.components_[1], 'Feature':colnames})
pcs_df.head(10)
pca.explained_variance_ratio_
#Making the screeplot - plotting the cumulative variance against the number of components
%matplotlib inline
fig = plt.figure(figsize = (12,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()
Looks like 9 components are enough to describe 95% of the variance in the dataset We'll choose 9 components for our modeling
#Using incremental PCA for efficiency - saves a lot of time on larger datasets
from sklearn.decomposition import IncrementalPCA
pca_final = IncrementalPCA(n_components=9)
df_train_pca = pca_final.fit_transform(X_train)
df_train_pca.shape
df_train_pca
#creating correlation matrix for the principal components
corrmat = np.corrcoef(df_train_pca.transpose())
#plotting the correlation matrix
%matplotlib inline
plt.figure(figsize = (20,10))
sns.heatmap(corrmat,annot = True)
# 1s -> 0s in diagonals
corrmat_nodiag = corrmat - np.diagflat(corrmat.diagonal())
print("max corr:",corrmat_nodiag.max(), ", min corr: ", corrmat_nodiag.min(),)
# we see that correlations are indeed very close to 0
#Applying selected components to the test data - 9 components
df_test_pca = pca_final.transform(X_test)
df_test_pca.shape
Use Support vector machines and use grid search (try C values - 0.01, 0.05, 0.5, 1 and kernel = linear, rbf) and find out the best hyper parameters and do cross validation to find the accuracy. (10 points)
Let's first try building a linear SVM model (i.e. a linear kernel).
from sklearn import svm
from sklearn import metrics
# an initial SVM model with linear kernel
svm_linear = svm.SVC(kernel='linear')
# fit
svm_linear.fit(X_train, y_train)
# predict
predictions = svm_linear.predict(X_test)
predictions[:10]
# evaluation: accuracy
# C(i, j) represents the number of points known to be in class i
# but predicted to be in class j
confusion = metrics.confusion_matrix(y_true = y_test, y_pred = predictions)
confusion
# measure accuracy
metrics.accuracy_score(y_true=y_test, y_pred=predictions)
# class-wise accuracy
class_wise = metrics.classification_report(y_true=y_test, y_pred=predictions)
print(class_wise)
Let's now try a non-linear model with the RBF kernel.
# rbf kernel with other hyperparameters kept to default
svm_rbf = svm.SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
# predict
predictions = svm_rbf.predict(X_test)
# accuracy
print(metrics.accuracy_score(y_true=y_test, y_pred=predictions))
The accuracy achieved with a non-linear kernel is very low compare to linear one.
# conduct (grid search) cross-validation to find the optimal values
# of cost C and the choice of kernel
from sklearn.model_selection import GridSearchCV
parameters = {'C':[0.01, 0.05, 0.5, 1],
'gamma': [1e-2, 1e-3, 1e-4]}
# instantiate a model
svc_grid_search = svm.SVC(kernel="rbf")
# create a classifier to perform grid search
clf = GridSearchCV(svc_grid_search, param_grid=parameters, scoring='accuracy')
# fit
clf.fit(X_train, y_train)
# results
cv_results = pd.DataFrame(clf.cv_results_)
cv_results
# converting C to numeric type for plotting on x-axis
cv_results['param_C'] = cv_results['param_C'].astype('int')
# # plotting
plt.figure(figsize=(16,6))
# subplot 1/3
plt.subplot(131)
gamma_01 = cv_results[cv_results['param_gamma']==0.01]
plt.plot(gamma_01["param_C"], gamma_01["mean_test_score"])
plt.plot(gamma_01["param_C"], gamma_01["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.01")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='lower right')
plt.xscale('log')
# subplot 2/3
plt.subplot(132)
gamma_001 = cv_results[cv_results['param_gamma']==0.001]
plt.plot(gamma_001["param_C"], gamma_001["mean_test_score"])
plt.plot(gamma_001["param_C"], gamma_001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.001")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='lower right')
plt.xscale('log')
# subplot 3/3
plt.subplot(133)
gamma_0001 = cv_results[cv_results['param_gamma']==0.0001]
plt.plot(gamma_0001["param_C"], gamma_0001["mean_test_score"])
plt.plot(gamma_0001["param_C"], gamma_0001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.0001")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='lower right')
plt.xscale('log')
plt.show()
# optimal hyperparameters
best_C = 1
best_gamma = 0.001
# model
svm_final = svm.SVC(kernel='rbf', C=best_C, gamma=best_gamma)
# fit
svm_final.fit(X_train, y_train)
# predict
predictions = svm_final.predict(X_test)
# evaluation: CM
confusion = metrics.confusion_matrix(y_true = y_test, y_pred = predictions)
# measure accuracy
test_accuracy = metrics.accuracy_score(y_true=y_test, y_pred=predictions)
print(test_accuracy, "\n")
print(confusion)
Let's first try building a linear SVM model (i.e. a linear kernel).
from sklearn import svm
from sklearn import metrics
# an initial SVM model with linear kernel
svm_linear = svm.SVC(kernel='linear')
# fit
svm_linear.fit(df_train_pca, y_train)
# predict
predictions = svm_linear.predict(df_test_pca)
predictions[:10]
# evaluation: accuracy
# C(i, j) represents the number of points known to be in class i
# but predicted to be in class j
confusion = metrics.confusion_matrix(y_true = y_test, y_pred = predictions)
confusion
# measure accuracy
metrics.accuracy_score(y_true=y_test, y_pred=predictions)
# class-wise accuracy
class_wise = metrics.classification_report(y_true=y_test, y_pred=predictions)
print(class_wise)
Let's now try a non-linear model with the RBF kernel.
# rbf kernel with other hyperparameters kept to default
svm_rbf = svm.SVC(kernel='rbf')
svm_rbf.fit(df_train_pca, y_train)
# predict
predictions = svm_rbf.predict(df_test_pca)
# accuracy
print(metrics.accuracy_score(y_true=y_test, y_pred=predictions))
The accuracy achieved with a non-linear kernel is very low compare to linear one.
# conduct (grid search) cross-validation to find the optimal values
# of cost C and the choice of kernel
from sklearn.model_selection import GridSearchCV
parameters = {'C':[0.01, 0.05, 0.5, 1],
'gamma': [1e-2, 1e-3, 1e-4]}
# instantiate a model
svc_grid_search = svm.SVC(kernel="rbf")
# create a classifier to perform grid search
clf = GridSearchCV(svc_grid_search, param_grid=parameters, scoring='accuracy')
# fit
clf.fit(df_train_pca, y_train)
# results
cv_results = pd.DataFrame(clf.cv_results_)
cv_results
# converting C to numeric type for plotting on x-axis
cv_results['param_C'] = cv_results['param_C'].astype('int')
# # plotting
plt.figure(figsize=(16,6))
# subplot 1/3
plt.subplot(131)
gamma_01 = cv_results[cv_results['param_gamma']==0.01]
plt.plot(gamma_01["param_C"], gamma_01["mean_test_score"])
plt.plot(gamma_01["param_C"], gamma_01["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.01")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='lower right')
plt.xscale('log')
# subplot 2/3
plt.subplot(132)
gamma_001 = cv_results[cv_results['param_gamma']==0.001]
plt.plot(gamma_001["param_C"], gamma_001["mean_test_score"])
plt.plot(gamma_001["param_C"], gamma_001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.001")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='lower right')
plt.xscale('log')
# subplot 3/3
plt.subplot(133)
gamma_0001 = cv_results[cv_results['param_gamma']==0.0001]
plt.plot(gamma_0001["param_C"], gamma_0001["mean_test_score"])
plt.plot(gamma_0001["param_C"], gamma_0001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.0001")
plt.ylim([0.60, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='lower right')
plt.xscale('log')
plt.show()
# optimal hyperparameters
best_C = 1
best_gamma = 0.001
# model
svm_final = svm.SVC(kernel='rbf', C=best_C, gamma=best_gamma)
# fit
svm_final.fit(df_train_pca, y_train)
# predict
predictions = svm_final.predict(df_test_pca)
# evaluation: CM
confusion = metrics.confusion_matrix(y_true = y_test, y_pred = predictions)
# measure accuracy
test_accuracy = metrics.accuracy_score(y_true=y_test, y_pred=predictions)
print(test_accuracy, "\n")
print(confusion)
As part of the analysis we can see that in compare with PCA and orginal scaled dataset SVM results as follows
SVM Linear Model: 94%
SVM Non-linear Model: 49%
SVM Grid Search CV Method Kernal: 88%
SVM Optimal parameter rbf : 87%
SVM Linear Model: 77%
SVM Non-linear Model: 49%
SVM Grid Search CV Method Kernal: 86%
SVM Optimal parameter rbf : 84%
In this test we can consider for Grid Search CV linear (C values -1 and gamma 0.001 and kernel = linear, rbf) will be the best model.